class: center, middle, inverse, title-slide # Introduction to R for Data Analysis ## Data Visualization - Part 2 ### Johannes Breuer & Stefan Jünger ### 2021-08-05 --- layout: true --- ## Content of this session .pull-left[ **What we will do** - Visual checks of model assumptions - Plotting coefficients - Plotting predictions - Plotting multiple models ] .pull-right[ **What we won't do** - Again, no crazy models - No bayesian statistics ] --- ## Packages in this sessions .pull-left[ We will again rely on packages from the `easystats` collection - `paramters` - `performance` - **`see`** ] .pull-right[ <img src="data:image/png;base64,#C:\Users\mueller2\talks_presentations\r-intro-gesis-2021\content\img\logo_wall.png" width="1600" style="display: block; margin: auto;" /> ] .pull-left[ While it's straightforward to use, it still has some limitations. From time to time we will therefore also switch to the `sjPlot` package ] .pull-right[ <img src="data:image/png;base64,#C:\Users\mueller2\talks_presentations\r-intro-gesis-2021\content\img\sjplot_logo.png" width="160" style="display: block; margin: auto;" /> ] --- ## Fictional research question In this session, we will investigate if older people have higher levels of trust in societies' institutions. We will also have a look if politically right-leaning people tend to have lower levels of trust. What might be interesting then is how these two variables interact with each other. **This 'research' is purely based on ad-hoc hyptheses.** --- ## Estimating a linear model (OLS) We start with estimating a linear regression model including our variables of interest and two control variables. ```r linear_model <- lm( mean_trust ~ age_cat + sex + education_cat + pol_leaning_cat, data = gp_covid ) library(parameters) model_parameters(linear_model) ``` ``` ## Parameter | Coefficient | SE | 95% CI | t(3106) | p ## ------------------------------------------------------------------------------------ ## (Intercept) | 3.52 | 0.08 | [ 3.37, 3.67] | 46.55 | < .001 ## age_cat | 0.03 | 5.25e-03 | [ 0.02, 0.04] | 6.13 | < .001 ## sex | 0.09 | 0.02 | [ 0.05, 0.14] | 3.87 | < .001 ## education_cat | 0.02 | 0.02 | [-0.02, 0.05] | 0.89 | 0.374 ## pol_leaning_cat [left] | 0.05 | 0.03 | [-0.01, 0.10] | 1.72 | 0.085 ## pol_leaning_cat [right] | -0.28 | 0.05 | [-0.38, -0.17] | -5.21 | < .001 ``` --- ## Quick inspection using `base R` In this session, we won't show `base R` operations. But you could use `base R` for some basic model inspections too! .pull-left[ ```r par(mfrow = c(2, 2)) plot(linear_model) ``` ] .pull-right[ <img src="data:image/png;base64,#4_2_Data_Visualization_Part_2_files/figure-html/unnamed-chunk-2-1.png" style="display: block; margin: auto;" /> ] --- ## Using `easystats` `easystats` provides some really nice features to inspect the model fit. Here's a simple check for normality of the residuals. It is a combination of using a function from the `performance` package and then using `see` to plot it. .pull-left[ ```r library(performance) library(see) check_normality(linear_model) %>% plot() ``` ] .pull-right[ ``` ## Warning: Non-normality of residuals detected (p < .001). ``` <img src="data:image/png;base64,#4_2_Data_Visualization_Part_2_files/figure-html/unnamed-chunk-3-1.png" style="display: block; margin: auto;" /> ] --- ## QQ-Plot (`easystats`) You can also draw some classic QQ-plots for this purpose. .pull-left[ ```r check_normality(linear_model) %>% plot(type = "qq") ``` ] .pull-right[ ``` ## Warning: Non-normality of residuals detected (p < .001). ``` <img src="data:image/png;base64,#4_2_Data_Visualization_Part_2_files/figure-html/unnamed-chunk-4-1.png" style="display: block; margin: auto;" /> ] --- ## Heteroscedasticity (`easystats`) Something that is always a bit of an issue in regression analysis is heteroscedasticity. Again, it's straightforward to use. .pull-left[ ```r check_heteroscedasticity( linear_model ) %>% plot() ``` ] .pull-right[ ``` ## Warning: Heteroscedasticity (non-constant error variance) detected (p < .001). ``` <img src="data:image/png;base64,#4_2_Data_Visualization_Part_2_files/figure-html/unnamed-chunk-5-1.png" style="display: block; margin: auto;" /> ] --- ## Check outliers (`easystats`) And the same applies to outlier analysis. .pull-left[ ```r check_outliers(linear_model) %>% plot() ``` ] .pull-right[ <img src="data:image/png;base64,#4_2_Data_Visualization_Part_2_files/figure-html/unnamed-chunk-6-1.png" style="display: block; margin: auto;" /> ] --- ## There is more In the collection of `easystats` packages, there are [way more procedures]((https://easystats.github.io/see/articles/performance.html)) you can apply to your models. To replicate the summary of `base R` you could use this command: .pull-left[ ```r check_model(linear_model) ``` ] .pull-right[ <img src="data:image/png;base64,#4_2_Data_Visualization_Part_2_files/figure-html/unnamed-chunk-7-1.png" style="display: block; margin: auto;" /> ] --- class: center, middle # [Exercise](https://jobreu.github.io/r-intro-gesis-2021/exercises/Exercise_4_2_1_Plotting_Diagnostics.html) time 🏋️♀️💪🏃🚴 ## [Solutions](https://jobreu.github.io/r-intro-gesis-2021/solutions/Exercise_4_2_1_Plotting_Diagnostics.html) --- ## Getting to the substantive part After checking the model assumptions, and pretending everything's fine (it's not...), we will deal with the substantive part of our fictional research question. The traditional way of doing that it printing some regression tables and checking the individual coefficients. That's fine but plotting is also nice. We will now turn to - Coefficient Plots / Forest Plots - Prediction Plots --- ## Coefficient plot (`easystats`) `easystats` provides a really, really easy method of creating a coefficient plot in one to two lines. .pull-left[ ```r library(parameters) model_parameters(linear_model) %>% plot() ``` ] .pull-right[ <img src="data:image/png;base64,#4_2_Data_Visualization_Part_2_files/figure-html/unnamed-chunk-8-1.png" style="display: block; margin: auto;" /> ] --- ## Changing labels: it's `ggplot2`-based All of the plots in `easystats` are based on `ggplot2`. This means, after calling the `easystats` function we can draw additional or manipulate existing plot layers using the usual `+`-structure. .pull-left[ ```r library(ggplot2) model_parameters(linear_model) %>% plot() + scale_x_discrete( labels = # in reversed order! c( "Pol. Right (Ref. Center)", "Pol. Left (Ref. Center)", "Education", "Sex", "Age" ) ) ``` ] .pull-right[ <img src="data:image/png;base64,#4_2_Data_Visualization_Part_2_files/figure-html/unnamed-chunk-9-1.png" style="display: block; margin: auto;" /> ] --- ## Making it more dim and scientific Thus, we can easily manipulate the appearance, e.g., by turning to a less colorful theme. .pull-left[ ```r library(ggplot2) model_parameters(linear_model) %>% plot() + scale_x_discrete( labels = # in reversed order! c( "Pol. Right (Ref. Center)", "Pol. Left (Ref. Center)", "Education", "Sex", "Age" ) ) + scale_colour_grey(start = 0, end = 0) + guides(color = "none") ``` ] .pull-right[ <img src="data:image/png;base64,#4_2_Data_Visualization_Part_2_files/figure-html/unnamed-chunk-10-1.png" style="display: block; margin: auto;" /> ] --- ## The advantage of prediction plots Regression coefficients are sometimes a bit hard to read - they are based on the scale of the independent variable when unstandardized - it is ambigious what they actually mean substantially - for example, is a coefficient of .2 or .05 quite substantive or not? Predictions are a solution - they basically draw the regression line into the scatter plot of the Y and X variables - we can read at which level of X which value of Y is expected --- ## Predictions (`sjPlot`) For the purpose of plotting predictions, we will use the `plot_model()` function from the `sjPlot` package. In principle, it would also be possible with `easystats` but I find `sjPlot` easier to work with in this case... .pull-left[ ```r library(sjPlot) linear_model %>% plot_model( type = "pred", terms = "age_cat" ) ``` ] .pull-right[ <img src="data:image/png;base64,#4_2_Data_Visualization_Part_2_files/figure-html/unnamed-chunk-11-1.png" style="display: block; margin: auto;" /> ] --- ## Again, it's `ggplot2`-based .pull-left[ ```r linear_model %>% plot_model( type = "pred", terms = "age_cat" ) + xlab("Age") + ylab("Trust") + ggtitle("Linear Model") + theme_bw() ``` ] .pull-right[ <img src="data:image/png;base64,#4_2_Data_Visualization_Part_2_files/figure-html/unnamed-chunk-12-1.png" style="display: block; margin: auto;" /> ] --- ## Handy when it comes to interaction effects Let's now turn to the interaction between age and political leaning. Do age right-wing political attitudes moderate the relationship between age and trust? ```r linear_model_interaction <- lm( mean_trust ~ age_cat + sex + education_cat + pol_leaning_cat + age_cat:pol_leaning_cat, data = gp_covid ) model_parameters(linear_model_interaction) ``` ``` ## Parameter | Coefficient | SE | 95% CI | t(3104) | p ## ---------------------------------------------------------------------------------------------- ## (Intercept) | 3.49 | 0.08 | [ 3.33, 3.64] | 44.06 | < .001 ## age_cat | 0.04 | 6.41e-03 | [ 0.02, 0.05] | 5.78 | < .001 ## sex | 0.09 | 0.02 | [ 0.05, 0.14] | 3.82 | < .001 ## education_cat | 0.02 | 0.02 | [-0.02, 0.05] | 0.94 | 0.347 ## pol_leaning_cat [left] | 0.21 | 0.08 | [ 0.05, 0.36] | 2.63 | 0.009 ## pol_leaning_cat [right] | -0.60 | 0.17 | [-0.94, -0.26] | -3.45 | < .001 ## age_cat * pol_leaning_cat [left] | -0.02 | 0.01 | [-0.05, 0.00] | -2.16 | 0.031 ## age_cat * pol_leaning_cat [right] | 0.05 | 0.02 | [ 0.00, 0.09] | 1.92 | 0.055 ``` --- ## Interaction plot Particularly, regression parameters of interactions are hard to interpret. Again, `sjPlot` helps. .pull-left[ ```r linear_model_interaction %>% plot_model( type = "int" ) + xlab("Age") + ylab("Trust") + ggtitle("Linear Model") + theme_bw() ``` ] .pull-right[ <img src="data:image/png;base64,#4_2_Data_Visualization_Part_2_files/figure-html/unnamed-chunk-13-1.png" style="display: block; margin: auto;" /> ] --- class: center, middle # [Exercise](https://jobreu.github.io/r-intro-gesis-2021/exercises/Exercise_4_2_2_Plotting_a_Regression.html) time 🏋️♀️💪🏃🚴 ## [Solutions](https://jobreu.github.io/r-intro-gesis-2021/solutions/Exercise_4_2_2_Plotting_a_Regression.html) --- ## Estimating multiple models Something you may often face in your research is the estimation of multiple models because of - changing covariates or dependent variables - different model specifications We can compare models side by side using regression tables, but by now we should know: plotting is better. Let's create a binary variable using a median splitting approach ([don't try this at home](https://dx.doi.org/10.1198/000313008X366226)) ```r gp_covid <- gp_covid %>% mutate( mean_trust_median = ifelse( mean_trust <= median(mean_trust, na.rm = TRUE), 0, 1 ) ) ``` --- ## Logistic regressions We then can run a logistic regression based on the binary variable from the median split. ```r logistic_model <- glm( mean_trust_median ~ age_cat + sex + education_cat + pol_leaning_cat, data = gp_covid, family = binomial(link = "logit"), ) model_parameters(logistic_model) ``` ``` ## Parameter | Log-Odds | SE | 95% CI | z | p ## --------------------------------------------------------------------------- ## (Intercept) | -1.16 | 0.23 | [-1.61, -0.72] | -5.09 | < .001 ## age_cat | 0.09 | 0.02 | [ 0.06, 0.12] | 5.81 | < .001 ## sex | 0.17 | 0.07 | [ 0.03, 0.31] | 2.32 | 0.020 ## education_cat | 0.05 | 0.06 | [-0.06, 0.16] | 0.86 | 0.391 ## pol_leaning_cat [left] | 0.09 | 0.08 | [-0.07, 0.25] | 1.15 | 0.251 ## pol_leaning_cat [right] | -0.66 | 0.17 | [-1.00, -0.33] | -3.88 | < .001 ``` --- ## Linear probability model Alternatively, something more modern, using a linear regression model that converts to a linear probability model. Its regression coefficients can be interpreted as increases or decreases on a linear probability scale. ```r linear_probability_model <- lm( mean_trust_median ~ age_cat + sex + education_cat + pol_leaning_cat, data = gp_covid, ) model_parameters(linear_probability_model) ``` ``` ## Parameter | Coefficient | SE | 95% CI | t(3106) | p ## ------------------------------------------------------------------------------------ ## (Intercept) | 0.22 | 0.05 | [ 0.11, 0.33] | 3.96 | < .001 ## age_cat | 0.02 | 3.82e-03 | [ 0.01, 0.03] | 5.87 | < .001 ## sex | 0.04 | 0.02 | [ 0.01, 0.08] | 2.32 | 0.020 ## education_cat | 0.01 | 0.01 | [-0.01, 0.04] | 0.86 | 0.392 ## pol_leaning_cat [left] | 0.02 | 0.02 | [-0.02, 0.06] | 1.16 | 0.247 ## pol_leaning_cat [right] | -0.15 | 0.04 | [-0.23, -0.08] | -3.94 | < .001 ``` --- ## Comparing model fits (`easystats`) `easystats` gives you a nice spider plot when you compare the performances of each model. .pull-left[ ```r library(performance) compare_performance( linear_model, logistic_model, linear_probability_model ) %>% plot() ``` ] .pull-right[ <img src="data:image/png;base64,#4_2_Data_Visualization_Part_2_files/figure-html/unnamed-chunk-14-1.png" style="display: block; margin: auto;" /> ] --- ## Comparing model parameters (`sjPlot`) To compare the models' regression coefficients, I suggest to once again rely on `sjplot`. This time, instead of using `plot_model()`, we use `plot_models()`. .pull-left[ ```r plot_models( linear_model, logistic_model, linear_probability_model, std.est = "std" ) ``` ] .pull-right[ <img src="data:image/png;base64,#4_2_Data_Visualization_Part_2_files/figure-html/unnamed-chunk-15-1.png" style="display: block; margin: auto;" /> ] --- ## Adjusting it within the function The nice thing about `sjPlot` is that it provides a lot of options to manipulate the plot in its very own function. For example, to change the legend and axis labels we can use the `m.label` and `axis.labels` argument .pull-left[ ```r plot_models( linear_model, logistic_model, linear_probability_model, std.est = "std", m.labels = c( "Linear", "Logistic", "Linear Probability" ), axis.labels = c( "Pol. Right (Ref. Center)", "Pol. Left (Ref. Center)", "Education", "Sex", "Age" ) ) ``` ] .pull-right[ <img src="data:image/png;base64,#4_2_Data_Visualization_Part_2_files/figure-html/unnamed-chunk-16-1.png" style="display: block; margin: auto;" /> ] --- ## But it's still a `ggplot2` object In the end, it is also based on `ggplot2`. Thus, we can again use `ggplot2` layers to add more stuff to our plot. .pull-left[ ```r plot_models( linear_model, logistic_model, linear_probability_model, # std.est = "std", m.labels = c( "Linear", "Logistic", "Linear Probability" ), axis.labels = c( "Pol. Right (Ref. Center)", "Pol. Left (Ref. Center)", "Education", "Sex", "Age" ) ) + theme_bw() ``` ] .pull-right[ <img src="data:image/png;base64,#4_2_Data_Visualization_Part_2_files/figure-html/unnamed-chunk-17-1.png" style="display: block; margin: auto;" /> ] --- ## Plotting predictions from multiple models Neither `easystats` nor `sjPlot` provide an easy way to plot predictions from multiple models. For this purpose, we have to dig a bit deeper into the wrangling of the underlying data. Fortunately, what `sjPlot` supports is extracting the data from predicting a model. With these data at hand, we can simply create a data frame comprising predictions from multiple models. --- ## Extracting the predictions The procedure to extract the predictions from a model is similar to using the plotting command. But instead of using `plot_model()`, we use `get_model_data()` ```r linear_predictions <- get_model_data( linear_model, term = "age_cat", type = "pred" ) linear_predictions ``` ``` ## # Predicted values of mean_trust ## ## age_cat | Predicted | group_col | 95% CI ## ---------------------------------------------- ## 1 | 3.73 | 1 | [3.67, 3.79] ## 2 | 3.76 | 1 | [3.71, 3.82] ## 3 | 3.80 | 1 | [3.75, 3.84] ## 4 | 3.83 | 1 | [3.79, 3.87] ## 6 | 3.89 | 1 | [3.86, 3.92] ## 7 | 3.92 | 1 | [3.89, 3.95] ## 8 | 3.96 | 1 | [3.92, 3.99] ## 10 | 4.02 | 1 | [3.97, 4.07] ## ## Adjusted for: ## * sex = 1.49 ## * education_cat = 2.48 ## * pol_leaning_cat = center ``` --- ## Predictions from the other two models The same way we can extract the predictions from the logistic regression and the linear probability model. .pull-left[ ```r logistic_predictions <- get_model_data( logistic_model, term = "age_cat", type = "pred" ) ``` ] .pull-right[ ```r linear_probability_predictions <- get_model_data( linear_probability_model, term = "age_cat", type = "pred" ) ``` ] --- ## Combining the predictions .pull-left[ We combine the data using simple data wrangling operations. - `bind_rows()` is function to append rows from one dataset to another from `dplyr` - there's also `rbind()` in `base R` - we use `mutate()` to create an indicator to know which row belongs to what model of predictions ] .pull-right[ ```r library(dplyr) library(tibble) all_predictions <- bind_rows( linear_predictions %>% mutate( model = "Linear Model" ), logistic_predictions %>% mutate( model = "Logistic Model" ), linear_probability_predictions %>% mutate( model = "Linear Probability Model" ) ) %>% as_tibble() ``` ] --- ## Plotting them with `facet_wrap()` To create multiple predictions plots with a subplot for each model we use barebones `ggplot2`. Yet, it is not necessarily more complex than using `easystats` or `sjPlot`. .pull-left[ ```r ggplot( all_predictions, aes(x = x, y = predicted) ) + geom_line() + geom_line(aes(y = conf.high), linetype = "dashed") + geom_line(aes(y = conf.low), linetype = "dashed") + facet_wrap(~model) + theme_bw() ``` ] .pull-right[ <img src="data:image/png;base64,#4_2_Data_Visualization_Part_2_files/figure-html/unnamed-chunk-18-1.png" style="display: block; margin: auto;" /> ] --- ## Plotting only the binary models As the plot is not particularly pretty because of different scales of the dependent variable, we can also base it on filtered version of the predictions data frame. .pull-left[ ```r only_two <- ggplot( all_predictions %>% filter(model != "Linear Model"), aes(x = x, y = predicted) ) + geom_line() + geom_line(aes(y = conf.high), linetype = "dashed") + geom_line(aes(y = conf.low), linetype = "dashed") + facet_wrap(~model) + theme_bw() only_two ``` ] .pull-right[ <img src="data:image/png;base64,#4_2_Data_Visualization_Part_2_files/figure-html/unnamed-chunk-19-1.png" style="display: block; margin: auto;" /> ] --- ## What about average marginal effects (AME) In the previous session, we also learned from average marginal effects. - parameter estimates based on the 'real' empirical values of all other covariates - handy when relationships are non-linear - way easier to interpret than usual regression coefficients - also nice when using predictions --- ## The same logic for real AME We can also use the `margins` package and extract prediction information for our models. While the resulting columns names are different, the `cplot()` function also gives us information about X-values, estimates, and confidence bands. ```r library(margins) logistic_model_AME <- cplot( logistic_model, what = "prediction", draw = FALSE ) linear_probability_model_AME <- cplot( linear_probability_model, what = "prediction", draw = FALSE ) ``` --- ## Combining AME data After extracting the information, we simply can combine the two dataset again to get predictions for all included models. ```r all_AME <- bind_rows( logistic_model_AME %>% mutate(model = "Logistic Model"), linear_probability_model_AME %>% mutate(model = "Linear Probability Model") ) %>% as_tibble() ``` --- ## Plotting predictions based on AME Plotting them can then be conducted the same way as before. Please not, the different variable names in comparison to the `sjPlot` procedure. .pull-left[ ```r only_two_AME <- ggplot( all_AME, aes(x = xvals, y = yvals) ) + geom_line() + geom_line(aes(y = upper), linetype = "dashed") + geom_line(aes(y = lower), linetype = "dashed") + facet_wrap(~model) + theme_bw() only_two_AME ``` ] .pull-right[ <img src="data:image/png;base64,#4_2_Data_Visualization_Part_2_files/figure-html/unnamed-chunk-20-1.png" style="display: block; margin: auto;" /> ] --- ## Combining them with previous predictions Using the fabolous `patchwork` package, we can even plot the 'simple' predictions together with the AME ones in one plot. Theoretically, this would have been possible with combining the data as well... .pull-left[ ```r library(patchwork) (only_two + ggtitle("Simple Predictions") + ylab("Predicted Trust Value") + xlab("Age")) / (only_two_AME + ggtitle("AME Predictions") + ylab("Predicted Trust Value") + xlab("Age")) ``` ] .pull-right[ <img src="data:image/png;base64,#4_2_Data_Visualization_Part_2_files/figure-html/unnamed-chunk-21-1.png" style="display: block; margin: auto;" /> ] --- ## What is more? --- class: center, middle # [Exercise](https://jobreu.github.io/r-intro-gesis-2021/exercises/Exercise_4_2_3_Combining_Predictions.html) time 🏋️♀️💪🏃🚴 ## [Solutions](https://jobreu.github.io/r-intro-gesis-2021/solutions/Exercise_4_2_3_Combining_Predictions.html) --- ## Extracurricular activities